Reverse Data Management

نویسندگان

  • Alexandra Meliou
  • Wolfgang Gatterbauer
  • Dan Suciu
چکیده

Database research mainly focuses on forward-moving data flows: source data is subjected to transformations and evolves through queries, aggregations, and view definitions to form a new target instance, possibly with a different schema. This Forward Paradigm underpins most data management tasks today, such as querying, data integration, data mining, etc. We contrast this forward processing with Reverse Data Management (RDM), where the action needs to be performed on the input data, on behalf of desired outcomes in the output data. Some data management tasks already fall under this paradigm, for example updates through views, data generation, data cleaning and repair. RDM is, by necessity, conceptually more difficult to define, and computationally harder to achieve. Today, however, as increasingly more of the available data is derived from other data, there is an increased need to be able to modify the input in order to achieve a desired effect on the output, motivating a systematic study of RDM. We define the Reverse Data Management problem, and classify RDM problems into four categories. We illustrate known examples of RDM problems and classify them under these categories. Finally, we introduce a new type of RDM problem, How-To Queries. 1. DATA TRANSFORMATIONS Informally, a data transformation consists of a function from an input data source to an output data source. The natural evolution of data follows the directionality of the transformations, i.e. from source to target. Most data management tasks fall under this forward paradigm from a variety of perspectives: query processing, data integration, data mining, clustering and indexing. We study here a class of problems that focus on the reverse direction, i.e. against the direction of the data transformation (Fig. 1). In these problems one wants achieve a certain effect in the output data, and needs to act on the input data in order to achieve that effect. Examples include updating through views [12], data generation [8], causality computation [26], data cleaning [3]. We thus refer to these areas under the common term of Reverse Data Management, or RDM. Thus, RDM consists of the problems where one Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Articles from this volume were invited to present their results at The 37th International Conference on Very Large Data Bases, August 29th September 3rd 2011, Seattle, Washington. Proceedings of the VLDB Endowment, Vol. 4, No. 12 Copyright 2011 VLDB Endowment 2150-8097/11/08... $ 10.00. Source (Input) Target (Output) Reverse Data Management Data transforma9on Figure 1: Reverse Data Management reasons from a desired output instance or specification, to the required input. needs to compute a database input, or modify an existing database input, in order to achieve a desired effect in the output. All these problems share a common premise: they essentially reverse a transformation in order to achieve a desired target instance, or target properties. Our goal in this paper is to identify the commonalities and differences among these problems, based on the specifications of the problem requirements, and to propose a systematic study of RDM. We define a taxonomy of RDM problems, categorizing them into four groups, allowing us to identify and describe a new type of RDM that we call How-To queries. Often users, administrators, and analysts are interested in changing their data in ways that would achieve certain conditions and constraints: “What advertisements will result in the best sales increase, at a cost bounded by X?”, “How can I increase my clients’ return on investment with the minimum number of trades?”. How-to queries come as a natural extension of the reverse management space as we will observe in the following section, and are useful for strategy decisions and for modeling various data optimization problems. RDM is more difficult to define and to implement than direct data management, because of the simple fact that the inverse of a function is not necessarily a function. Given a desired output, or a desired change of the output, there are multiple inputs (or none at all) that can satisfy it. This difficulty shows up in all RDM problems: in updates through views one restricts the view to the most simple updateable views [12] or searches for updates that minimize the number of side-effects [10]; data cleaning and repair often results in NP-hard problems [3, 24]; in causality too one has to compute an NP-hard problem to find the causes of an output [26]. To circumvent this difficulty in a general framework, we propose the adoption of SAT or MaxSAT solvers as general purpose tools in RDM. These solvers handle tens of thousands of variables and clauses, and for many practical instances even millions (cf. [1]), making them attractive for RDM, and have recently been deployed for specific RDM problems [27]. We posit that, by using SAT or MaxSAT as primitives (oracles), many of the RDM problems become tractable in practice. 2. THE PROBLEM SPACE FOR RDM We classify RDM problems along two dimensions. Target Data. We distinguish between explicit target and implicit target specifications. In the case of an explicit target, the output is source no source data reference source ta rg et explicit Inversion mappings View updates specification Abduction Provenance Causality Data Generation Constraint-based repair implicit specification How-To Queries Figure 2: The space of Reverse Data Management, seen under source and target specifications. given as a specific data instance. Sometimes there is a distinction between different versions of the target (e.g. before and after a view update), but it always involves tuple-level instances. In the case of an implicit target, the output is described indirectly, through constraints, or through statistics, like in constraint-base data cleaning, or in declarative data generation. Since statistics and other constraints can also be viewed as transformations on the data, implicit descriptions can be transformed into explicit ones. However, we prefer to preserve the distinction, as we view the two cases as conceptually different: one is based on specific tuples, while the other on collective measures over an instance. In the former case, we know exactly what target data we aim for; in the latter case we only desire some general effect, usually given through some constraints, or aggregate statistics. Thus, the two cases differ in what restrictions they impose on the target data. Since the reverse data management paradigm involves a “backward” derivation, from target to source, some specification of the target data always needs to be part of the problem description. Source Data. We distinguish between RDM problems with a reference source data and without source data. In the first case we have a source data and we want to modify it to reflect a desired effect on the output: updates through views is the classic example. In the second case, we have to compute the entire input from scratch: the goal of inverse schema mapping is to give a simple tool to compute the input from the output [18]; another example is data generation, where we do not have any input and need to construct one from scratch. This classification of the two dimensions results in four types of RDM problems, which are illustrated in Fig. 2. Figure 2 depicts the reverse data management domain using the distinctions we made on source and target constraints. We classify some RDM problems using their most common form, but note that some of them could potentially be classified differently based on variations of the problem description. This exercise helps us identify an interesting new direction of this line of research that we callHow-To queries, which we motivate and describe in the following section. We will cover in more detail the examples mentioned in Fig. 2, but this is not meant to be an exhaustive list. In the next section we describe a specific RDM problem that has not yet been explored in literature, which falls under the fourth category. View Updates. In the various flavors of the view update problem ([10, 12, 13]), a source dataset is given, along with a query that specifies a view over the data. In the classical formulation, the goal is to describe the class of views that are updateable. In other formulations of the problem, one wants to modify the source data to achieve the desired update, without introducing additional sideeffects (or by minimizing them) [10]. The target data in this case is the view itself, and the problem is to determine how updates, insertions, and deletions to the view would be reflected to the source data. Thus, there is a reference source, and the output is explicit. Data Provenance; Database Causality. While data provenance ([11, 9, 20]) is computed “forwards”, from the source to the target, its purpose is often to enable users to trace information backwards: given an output tuple, its provenance describes which input tuples have contributed to it. Both target and source data are given in this problem statement, and the goal is to select only the relevant parts of the source that correspond to the target tuples of interest. Database Causality [26, 25] is a refinement of data provenance, where the target tuples may deviate from what the user expects (unexpected tuples appear in the result, or expected ones do not), and the goal is to select the appropriate source tuples as causes of given target tuples. In both cases, the problem statement specifies an explicit target where the output is given as a specific data instance, and a source data instance from which the proper parts will be selected. Inversion Mappings. Data exchange deals with a problem that often arises in cases of data integration, where we need to transform data from one schema to another. A schema mapping is a specification that describes how data from a source schema A is to be mapped to a target schema B. The inversion of a mapping ([4, 17, 2, 5]) describes the reverse transformation from B to A, the goal being to recover the original source data. Various definitions of schema mappings and their inversions exist, but in general we still have an explicit specification of the target dataset, however, we do not necessarily have any reference source as was the case in the problems we’ve covered so far. Abduction. Abduction [15, 28] is a related problem where the goal is to find hypothetical explanations for an observed consequence. In the general case no reference source needs to be assumed, but abduction techniques have also been linked to repairs [6] and view updates [22] (possibly classifying it in a different box). Data Generation. In order to test correctness and performance of algorithms and systems, synthetic data is often necessary, as it allows arbitrary scale up, and exhaustive exploration of properties and parameters that can not always be found in real data. Generating synthetic data with meaningful characteristics that sufficiently resembles real data behavior is a very hard and challenging problem in database research ([21, 19, 8]). In data generation, there is usually no reference source data, but rather the data needs to be generated from scratch, commonly based on specific statistics. Some common statistics involve the size of relations, number of unique attribute values, as well as correlations between them. A valid solution to the data generation problem is source data that satisfies all constraints given by the problem description. As opposed to the previous problem settings, the target here is implicitly specified using statistics and constraints, rather than a fully defined data instance. Constraint-based Repair. In this problem we are given a source database instance and a target constraint, such as a key constraint or some conditional functional dependencies. The goal is to repair the source data in order to satisfy the constraint [3]. While the source data exists (reference source), there is no target data, instead there is only a target constraint, so the target specification is implicit.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Organization's performance measurement model based on the critical success factors of the reverse supply chain in airline industry with a quality gap approach

Airline industry is one of the main infrastructures for sustainable development of a country. The quality of the reverse support service will be effective in increasing the safety and health of the structures, reducing the impact of disasters and reducing costs. The aim of this study is to evaluate the performance of an organization based on the main factors of reverse supply chain with the ser...

متن کامل

Research on the application of the internet of things in Reverse Logistics Information Management

Purpose: Combined the current situation with the development trend of reverse logistics, the article focus on the research of Internet of Things application in the reverse logistics information management, starts with the study of reverse logistics information system, and describes the system structure and system process in applying Internet of Things in reverse logistics information management...

متن کامل

A Complex Design of the Integrated Forward-Reverse Logistics Network under Uncertainty

Design of a logistics network in proper way provides a proper platform for efficient and effective supply chain management. This paper studies a multi-period, multi echelon and multi-product integrated forward-reverse logistics network under uncertainty. First, an efficient complex mixed-integer linear programming (MILP) model by considering some real-world assumptions is developed for the inte...

متن کامل

Identifying the Dimensions and Components of Education based on Reverse Learning in the Elementary School

Purpose: The learning style of students is one of the important factors that help teachers prepare the conditions for students to learn. As a result, the purpose of this research was to identifying the dimensions and components of education based on reverse learning in the elementary school. Methodology: This study in terms of purpose was applied and in terms of implementation method was descr...

متن کامل

Informal and Intelligent Acquisition of Semantic Constraints in Database Design and Reverse Engineering 1

The main objective of database modelling is the design of a database that is correct and can be processed eeciently by a database management system. The eeciency and correctness of a database depends among other things on knowledge about database semantics because semantic constraints are the prerequisite for normal-isation and restructuring operations. Acquisition of semantic constraints remai...

متن کامل

The Dissemination of Good Practices in Database Development Work

DTD Graph from an XML Document: A Reverse Engineering Approach Joseph Fong and Herbert Shiu (2010). Principle Advancements in Database Management Technologies: New Applications and Frameworks (pp. 204-224). www.igi-global.com/chapter/abstract-dtd-graph-xml-document/39357?camid=4v1a Data Management Issues in Information Systems Carl Stephen Guynes and Michael T. Vanecek (1995). Journal of Databa...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • PVLDB

دوره 4  شماره 

صفحات  -

تاریخ انتشار 2011